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Abstract. In this paper, an online learning algorithm is proposed as sequential stochastic ap- 
proximation of a regularization path converging to the regression function in reproducing kernel 
Hilbert spaces (RKHSs). We show that it is possible to produce the best known strong (RKHS 
norm) convergence rate of batch learning, through a careful choice of the gain or step size sequences, 
depending on regularity assumptions on the regression function. The corresponding weak (mean 
square distance) convergence rate is optimal in the sense that it reaches the minimax and individual 
lower rates in the literature. In both cases we deduce almost sure convergence, using Bernstein-type 
inequalities for martingales in Hilbert spaces. 

To achieve this we develop a bias-variance decomposition similar to the batch learning setting; 
the bias consists in the approximation and drift errors along the regularization path, which display 
the same rates of convergence, and the variance arises from the sample error analysed as a reverse 
martingale difference sequence. The rates above are obtained by an optimal trade-off between the 
bias and the variance. 



1. Introduction 

Consider the following classical problem of learning from examples: given a sequence of i.i.d. 
random examples {zt = {xt,yt))tef^ drawn from a probability measure p on ^ x one seeks to 
approximate the regression function 



fp{x) ■■= / ydp«y\x, 



i.e. the conditional expectation of y given x. Recall that fp minimizes the following mean square 
error 

(1) m = I (fix) - yfdp. 
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The error of the approximation / of fp is estimated for instance through the norm ||/ — /p||oo or 
11/ - fpWp, where 

^ 1/2 



c/>2 



\f{x)\'dpx 



' X 

{px being the marginal distribution of p on X), or through other norms in Hilbert spaces which, 
as we shah see later, may capture different regularity features of this approximation. 



An online learning algorithm aims at obtaining this approximation of the regression function 
recursively, using at each time step the new example zt = {xt,yt) to update the current hypothesis 
ft-i (approximating fp) to ft- In other words, ft = 7t(/t_i, Zt) for some map : J^x ^ — )• Jif, 
where is a Hilbert space of functions from ^ to see for example [Sniale and Yao 2006]. 

On the contrary, batch learning algorithms process a sample set given once and for all at some 
fixed time m, i.e. z = {{xi,yi)}'^i. The classical bias-variance paradigm is that of a trade-off 
between the requirement to fit the data, i.e. to provide a small empirical error 



^ m 
m ^ — ^ 



m ^ — ' 

i=l 



and the size of the space in which / can take place, in order to limit the impact of the noise 
created by the data. For instance, a Tikhonov regularization (or Ridge Regression) procedure as 
in [Evgeniou, Pontil, Poggio 2000] yields, given A > 0, 



) 1 

/z,A := arg min <^ — y'(/(a 




(Xi) — Vi 

1=1 

For more background on regularization of inverse problems, see for instance [Engl, Hankc, and Ncubauer 1996]. 
In modern statistics, an Li-type regularization called LASSO [Tibshirani 1996], is proposed in pur- 
suit of sparsity of fp with respect to certain basis. 

The regularization parameter A is chosen as a function of the sample size m, and of some prior 
knowledge on the regularity of the function fp. In this setting, probabilistic upper bounds of 
ll/m — fp\\,yP were obtained for instance in [C'uckcr and Smalc 2002, Smalc and Zhou 2005]. 

In online learning, the sample size t is changing over time, so that the regularization parameter 
needs to be updated at each time step, and follows the regularization path defined as follows. Let, 
for all A > 0, /a be the solution of the regularized least square problem 

(2) A = argmin^(/) + A||/||2^. 

Depending on assumptions on the Hilbert space M' and on the regularity of /p, f\ converges to fp 
in .^p^ or J^-norm when A — t- 0. The map 

/. : M+ — > ^ 



fx 



is called regularization path of /„ in ,J^. 



Regularization paths gain rising attention from statistics, in particular in the LASSO case 
[Efron, Johnstone, Hastie, and Tibshirani 2004], where they are piecewise linear with respect to 
the parameter, which enables one to track the entire path with nearly the same amount of compu- 
tational cost as a single fixed regularization. This property generalizes to the case where the loss 
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and the penalty are respectively piecewise quadratic and linear [Rosset and Zhu 2007]; note that 
this however does not include Tikhonov regularization. 

Our purpose in this paper is to iteratively define an "online" sequence of functions {ft)t<=N £ 
which will provide a stochastic approximation of the Tikhonov regularization path (/AjteN £ 
With an adequate choice of the regularization parameters At — )• based on a bias-variance trade- 
off, we show such a sequential stochastic approximation to be optimal in the sense that it reaches 
minimax and individual lower bound rates of convergence. 

Our algorithms can be regarded as stochastic gradient descent algorithms to solve (2) with time 
varying regularization parameter Aj, an extension from early works [Smalc and Yao 2006; Yao 2010] 
which investigate the convergence ft — >• fx for fixed At = A > 0. In that case a weak probabilistic 
upper bound for ||/t — /a|| was first proposed in [Smale and Yao 2006], based on Markov's inequality. 
Improved upper bounds were later obtained in [Yao 2010], leading in some cases to the same rate 
of convergence of {ft)t<=N to fx as in batch learning given t examples. 

However, as we shall see in this paper, time-varying At was not addressed so far and leads to 
a more complicated bias-variance decomposition, whose heuristics is related to the existence of 
a phase transition in the convergence rate in stochastic approximation. We refer the reader to 
[Duflo 1996; Kushncr and Yin 2003] for background on stochastic algorithms. 

As in previous studies, we choose in this paper the Hilbert space Jif to be a reproducing kernel 
Hilbert space (RKHS) J^k for some kernel K. RKHS enables one to analyze nonpar ametric regres- 
sions in a coordinate-free manner, and the gradient descent method then takes an especially simple 
form [e.g. Kivinen, Smola, and Williamson 2004]. Moreover, RKHS provides a unified framework 
in several important settings, e.g. 

(i) generalized smooth spline functions in Sobelev spaces [Wahba 1990], 

(ii) real analytic functions with bounded bandwidth [Daubechies 1992] and their generalizations 
[Smalc and Zhou 2004], 

(iii) gaussian processes [Loeve 1948; Parzen 1961]. 

In fact, any Hilbert space of functions on ^ with a bounded evaluation functional is a RKHS 
[Wahba 1990]. By choosing suitable kernels, J^fk can be used to approximate any function in ^p^^- ] 
see for instance [Bcrlinct and Thomas- Agnan 2004] for wider background on RKHS. 

Our analysis starts in the setting of a general Hilbert space W in Section 3, with the study of 
an iteratively defined sequence, which is a stochastic approximation of the solution of some linear 
equation. This study will be specialized in later sections to the cases W = J^k or in order 

to show the main results of the paper. Two structural decomposition theorems are introduced in 
that Section 3, the reversed martingale decomposition and the martingale decomposition, and play 
an important role in the proof of the main results, the former being suitable for strong convergence 
in Jlfx and the latter for weak convergence in ^p^ ■ 

Both decompositions lead to the breakdown of the total error ft — fp into four parts: the initial 
error caused by the initial guess /o, the sample error as a reverse martingale difference sequence, 
the approximation error fx^ — fp, and the drift error along the regularization path (fxt) caused by 
time- varying Af. 
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By a suitable choice of step sizes, the initial error won't affect the convergence rates. Now 
a key observation is that the drift error, which does not appear in previous fixed regularization 
settings with Aj = A, has the same order as the approximation error. Bernstein-type inequalities 
for martingales in Hilbert spaces are then used to bound the sample error. Therefore we have a 
similar bias-variance decomposition in online learning as in batch learning, with the bias being the 
approximation and the drift errors, and the variance being the sample error. It is then possible to 
optimize in order to yield the same optimal rates in online learning as in batch learning. 

The main theorems in this paper provide some probabilistic upper bounds for the convergence 
of (/t)teN to /p, in or -^p^^ , under the assumption that fp G has additional regular- 

ity. The convergence rate in ^p^ is optimal in the sense that it reaches the minimax and in- 
dividual lower rate. The convergence rate in Jifx meets the same best rates as in batch learning 
[Sniale and Zhou 2005]. Both upper bounds depend on a logarithmic power a > of the confidence 
threshold 6 (i.e. 0(log" They imply by Borel-Cantelli Lemma the almost sure convergence 

of ft to fp in and ^p^ ■ Such a theorem improves on our early result in 2006 (see [Yao 2006] ) , 
where in mean square distance the upper bounds depended polynomially on the confidence (i.e. 
0(5~")), and whence solves the open problem raised therein. 

The paper is organized as follows. Section 2 collects the main results. Section 3 studies stochastic 
approximations of regularization paths for linear operator equations in general Hilbert spaces, where 
the key martingale and reverse martingale decompositions are presented. Section 4 collects some 
estimates on the drift along the regularization path, ||/a — f^\\ (A,/i > 0), which are needed for 
the study of the bias, i.e. the approximation and drift errors. Sections 5 and 6 respectively yield 
upper bounds for convergence in J^k and ^p^ ■ Appendix A derives a probabilistic inequality 
from the Pinelis-Bernstein inequality for martingales in Hilbert spaces, which is used to derive the 
probabilistic upper bounds in this paper. Appendix B collects some preliminary upper bounds used 
in the paper. Appendix C gives proofs of some results in Section 3.2. 

2. Main Results 

2.1. Notations and Assumptions. Let ^ C M" be closed, ^ = M and ^ = x ^. Let p 
be a probability measure on 2f, p,g: be the induced marginal probability measure on ^ , and let 
pa^^x be the conditional probability measure on ?V with respect to x G X. Define fp:^^'3^hy 
fp{x) = fa^ ydpg^^x, the regression function of p. In the sequel, we let E[-] be the expectation with 
respect to p. 

Let ^p^ be the Hilbert space of square integrable functions with respect to p^. In the sequel 
II lip denotes the norm in ^p^- 

Let iTic^x^— )-Mbea Mercer kernel^ i.e. a continuous^ symmetric real function which is 
positive semi- definite in the sense that CiCj-fC(xj, Xj) > for any m G N and any choice of 

Xi £ X and Cj € M (i = 1, . . . , m). A Mercer kernel K induces a function : =^ — t- M (x G J^T) 
defined by Kx{x') = K{x,x'). Let J^k be the reproducing kernel Hilbert space (RKHS) associated 
with a Mercer kernel K, i.e. the completion of the spanji^a; : x S with respect to the inner 

"'^In computer science literature, one often bears in mind some implicit feature map "I> : J?r — >■ which takes an 
input vector a; to a high (or infinite) dimensional feature vector, say an element of a Hilbert space Jif, and then one 
considers explicitly the inner product K{x,x') = {^{x),${x')) as the kernel. In this construction, the continuity of 
K is equivalent to continuity of the feature map $. 
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product, defined as the linear extension of tlie bilinear form {K^, K^i) = K{x,x') {x,x' G 
The norm of J^k is denoted by || \\k- The most important property of RKHS is the reproducing 
property: for all / G J^k and x £ X, f{x) = (/, Kx)j^. 

Throughout this paper, assume that 

Finiteness Condition. (A) There exists a constant k >0 such that 



K := sup \/ K{x, x) < oo. 



(B) There exists a constant Mp > such that 

supp(p) C X [-Mp,Mp]. 



Define the linear map 



by the following integral transform 

LK{f){x) := / K{x,t)f{t)dpx{t). 



X 

It is well-known that is well-defined, and that composition with the inclusion Jifx ^ -^px yi^^^^ 
a compact positive self-adjoint operator on [e.g. Halmos and Sunder 1978, Cucker and Smale 2002]. 
The restriction LK\.?^ji '■ j'^k — ^ -^K is the covariance operator of px in ^^k- by the reproducing 
property. 

Abusing notation, we will denote the three operators by Lx in the sequel. 
Note that, by Cauchy-Schwarz inequality, ||-L/^/||oo < ^^Il/llifp2 , so that 

The compactness of Lk ■ -^n^^ ~^ -^ov- implies the existence of an orthonormal eigensystem 
{fJ-a, 4'a)aeN in -^ps; ■ R-Scall that (see [Cucker and Smale 2002] for instance) 



^fia=l K{x,x)dps-{x) < K.^. 

X 



We assume in this paper that all eigenvalues are positive. We can define, for all r > 0, 

(4) ^aQ,(/»Q, ^ aap!'^(j)a\ 



a6N q6N 

'1/2 



Uj^ can be regarded as a low-pass filter, and HL^i'll = niaXagN /i^ = Note that L^^ : 

-psc 



— )■ Ji^K is an isometrical isomorphism of Hilbert spaces. Hence the eigenfunctions ((/)a)aGN 



are orthogonal both in ^p^. and M'k- 

For all / G -Sfp-jr and r > 0, we write L'^ f G -^p^^; when / lies in the image of the mapping 
: .^p^ — )• -^paf^- Note that, if r > 1/2, then this implies / G because of the isometry L^J.^ 
between ^p^ and Ji^k- 
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For all A > and r e M \ {0}, we can similarly define {Lk + \IY ■ -^oar^ which is a 

bijection, since Ylam < oo is equivalent to Ylam ^ai^ + f^'aY < oo, using ^q, -^a^oo 0. 

It can be shown [e.g. Cucker and Smale 2002] that for any A G M+, the solution of (2) is 

(5) fx = {Lk + Xiy^LKfp G J^K. 

In this paper, hy Bi,Ci, Di, B2,C2, D2, ■ ■ ■, we denote various constants, which are defined "lo- 
cally" in the sense that the same notations appeared in different sections has different meanings. 

2.2. Stochastic Gradient Algorithms. Let T = {J't)teNo S =^ x be the filtration Tt = 
a{{xi,yi) : 1 < i < t}. In the sequel denote by Ej = E[-|J^t], the conditional expectation w.r.t. J^f 
Consider the following J^t-adapted process {ft)teN taking values in J^k, 

(6) ft = ft-i - 7t[(/t-i(xt) - yt)i^xt + Ai/t_i], for some fixed /o € J^k-, e.g. /o := 
where 

(I) for each t, {xt,yt) is independent and identically distributed (i.i.d.) according to p; 

(II) the gain (step size) sequence {■jt)t£N and regularization sequence {Xt)t£N are taking values in 
M_(_ := (0,00), and converging to as t goes to infinity. 

Remark 2.1. The computational cost of this algorithm typically is 0{t^). As each step t, the main 
computational cost is due to the evaluation ft-i{xt) which needs to access all K^^ (1 < ? < t) in 
0{t) steps. Thus the total cost is of 0{t^) at time t. In the cases that one can store and access 
the values ft{x) for all x, e.g. on a grid of ^ , the computational cost is merely linear 0{t) at the 
requirement of large memory and fast memory access. 

By reproducing property, we can see that the gradient map of 

V.if) = \[{f{x) - yf + All/Ill,], . = G ^ 

is given by g-£SidVz{f) = (fix) — y)Kx + A/ [e.g. Smalc and Yao 2006], as a random variable 
depending on z. Since the expectation E[y^(/)] = 2(<f (/) + A||/|||-), algorithm (6) can thus be 
regarded as stochastic approximations of gradient descent method to solve (2), for each A = At. 

2.3. Main Theorems. Theorem A provides sufficient conditions for the convergence of the online 
learning sequence {ft)t&io in (6) to the regression function fp. Theorem B and C explicit the 
corresponding convergence rates, respectively in M'k and -^pa^^ ■ 

Theorem A (Sufficient conditions for convergence). Assume fp G J'^k, and let (ft) be defined by 
equation (6), with assumptions (I)- (II). Then 

limsupE[||/t-/p||2^] =0, 
if the following conditions are satisfied: 
(A) ^ 7tAt = 00. 

i— >-oo 

n n 

ri?;iimsup^72 II (1-7,A0' = 0, 
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n 



n 



(C) limsup^ II/a^ - /Afe.Jk 



n (1 - = 0. 



i=k+l 



This theorem will be proved in Section 3, as a consequence of Theorem 3.5 in the setting of 
Hilbert spaces. Assumptions {B) and (C) can be replaced by the stronger (but less technical) 
assumptions {B') and (C) in Corollary 3.7 that "jt/^t — ^ and \\fx^ — Jx^^-^Wk / {\lt) — ^ 0. 

Remark 2.2. Although \t — )• 0, condition (A) puts a restriction that can not drop too fast, in 
fact this is necessary to "forget" the error caused by the initial guess /q. Condition (B) says that the 
step size 74 — )• 0, and it has to drop faster than the regularization parameter Aj. Such a condition 
is to attenuate the random fluctuation caused by sampling. Condition (C) implies that the drifts 
of the regularization path (/aJ converges to zero, at a speed faster than 74 A*. This condition says 
that in the long run, the drifts along the regularization path should decrease fast enough for the 
algorithm to follow the path. The drifts depend on regularity of /p, that the smoother fp is, the 
faster drifts go down. 

In the next two theorems (B) and (C) we choose the sequences {'^t)t&i and {\t)t&i in order to 
optimize the rates of convergence in and .^p^- . This optimization is twofold. 

First, the study of convergence of approximations of ordinary differential equations generically 
yields a phase transition between a slower rate with "shadowing" of mean- field trajectories, and a 
faster one, normally distributed after renormalization. Even though the picture is more complicated 
in our case, in particular because the vector ft is infinite-dimensional, this justifies here that we 
choose 7tA( reciprocally linear in t. 

Second, optimization over (7^) at fixed (7tA() yields a bias-variance trade-off similar to the one 
observed in statistical "batch" learning, which relies on the regularity assumption on the regression 
function fp. 

More precisely, let us first recall the phase transition in classical finite-dimensional stochastic 
approximation, in the rate of convergence towards a stable equilibrium. Naturally, we study the 
projections of the algorithm on the base of eigenvectors of the linearization of the ordinary differ- 
ential equation at the equilibrium. Let {r}t)t£n be one of these projections, and assume for instance 
that the corresponding eigenvalue is —1, so that the stochastic recursion is of the form 

rjt+i =r]t+ 7f (-r/t + et+i + n+i), 

where Et_i[e(] = 0, (e^) is bounded, and (r^) is small. For simplicity we will assume that = 
(which corresponds to the special case A^ = A is a constant), but the heuristics holds on to the 
general case where rt is less than quadratic in all coordinates. Let, for all i E N, /3f := Yli^^ii^ — "Jk) ■ 
Then it is easy to show by induction that 



t 



r]t = /3t Vo + Yl -i-.^i 

j=i 3 



Now suppose for instance that 7^ ~ c/t (c > 0); then /3„,n'^ — > C > 0. Depending on the choice 



of 



c, rjt exhibits the following phase transition at c = 1/2 in its asymptotic dynamics. 
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• If c < 1/2 then J^ilj/f^j)"^ < °o, therefore X]j=i Ti^j/Z^i converges a.s. by Doob's conver- 
gence theorem, which imphes that rjtt^ — t- C (where C is a positive random variable), in 

t— >oo 

other words that (rjt) asymptotically "shadows" one solution of the ODE 

dx 

— = —ex. 
dt 

• On the contrary, if c > 1/2, then '^{'Jj / Pj)"^ = oo, and by the martingale convergence the- 
orem (see for instance [Williams 1991]), assuming for instance Et_i[e^] = > constant, 
and rjtVi converges towards a centered normally distributed random variable with variance 
a?D'^/{2a — 1), and follows an associated Ornstein-Uhlenbeck process, see [Duflo 199C] for 
instance. 

Therefore it suffices to choose c > 1/2 to achieve fast convergence rates. In this paper we will 
set c = 1 and choose jt^t ~ 1/t to meet the heuristics above. 

The next two theorems present some probabilistic upper bounds which characterize the con- 
vergence rates in J^k and , under certain regularity assumptions on the regression function 

Let to > and, for all i G N, 

t:=t + to, 

where to is large enough which won't affect the speed of convergence. We assume, in the statement 
of Theorems B and C, that, for all t gN, 

2r 1 
/1\ 2r+l 1 /1\ 

Theorem B (Upper Bounds for Jfft--convergence) . Assume LjJ fp G -2'^^, for some r G (1/2, 3/2], 
a > 1, and > qk? + 1. Then, for all t G N, with probability at least 1 — 6, 



ll/t - /pik < f + (ciaV2- log I + C^a) (0 



where 

Co '■= '^to"^'^ Mp, Ci := — — — — -i^WLj/' fpWp, C2 ^ P 



^ 20r-2 ur-r^u ^ _20{k + If Mp 

(2r - l){2r + 3y 



Its proof is given in Section 5. 
Remark 2.3. Given 6 > 0, Mp and ||-Z^x'^/p||p, one can optimize a in order to minimize 

h{a) :=C7iai/2"Mog| + C72a. 



This yields the choice a* := [Ci/((r - 1/2)C2)]^''~^^^^^'\ with 

h{a*) = (r + 1/2) 



(r - 1/2) 

This asymptotic rate in Mp^*^ ||^-r-y^||2/(2r+i)^_(r-i/2)/(4r+2) same as the best known 

rates in batch learning algorithms; see [Theorem 2, Smale and Zhou 2005]. 
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Remark 2.4. Note that the upper bound consists of three parts. The first term at a rate 0{t~^), 
captures the influence of the initial choice /o = 0, which does not depend on r and is faster than 
the remaining terms. The second term at a rate 0{\\LjT fp\\pt~^'^^~^^^^^'^~^'^^), collects contributions 
from both drifts along the regularization path fx^ — fxt_j and the approximation error fx^ — fp, since 
they share the same rates up to different constants. The third term at a rate 0{t~^'^^'~^^^^^^'^'^^), 
reflects the error caused by random fluctuations by the i.i.d. sampling. Later as we will see, the 
second term is a bound on the bias and the third term is a bound on the variance. 

Theorem C (Upper Bounds for ^p^^. -convergence). Assume that fp G fo'r some r £ 

[1/2, 1]. Assume a > 4, and > 2 + Sk^u. 

Then, for all t G N, with probability at least 1 — 5 (5 £ {0, 1) ), 

r 4r— 1 

\\ft-fp\\p < ^+(^D,a-^ + V^D2logfj (^^y''^' + [a^/^D,y^t + a^/^D,) {\og{2/5)f (^^Y^'' 

^1 •= 4t^II^//pIIp' D2:=WkMp, Ds = 63K^Mp, := 50k^ Mpt]/^"' . 
r(l + r) 

Its proof will be given in Section 6. 

2r-l/2 , 

Remark 2.5. When r G (1/2, 1], the first term of 0{l/t) and the third term of 0{t 2r+i log^'^ t) 

r 

both drop faster than the second term of 0{t ^r+i)^ whence they can be ignored asymptotically. 
The second term as the dominant one, roughly speaking has contributions from two parts: the one 
with constant Di comes from the bias, i.e. the approximation and the drift errors, while the other 
with constant D2 comes from the variance, i.e. the sample error. 

Remark 2.6. A special case is r = 1/2, which is equivalent to say fp G J^k- In this case = = 
t , whence it does not satisfy the Path Following Condition (B) in Theorem A. But Theorem C 
suggests a weaker notion that ft follows the regularization path, i.e. ft — ?• fp in .^pg^ rather than 
M'k, which in fact converges at a rate of 0{t-^/'^ log^/^ t) uniformly for all fp G ^k- 

Remark 2.7. In all, the convergence in ^po^ has rates 0(t~''/(2''+i) log^l"^ t ■ log^ l/<^)) a logarithmic 
polynomial on 6, whence the Borel-Cantelli Lemma implies almost sure convergence \\ft— fp\\^^ 
0. This result solves an open problem raised early in [Yao 2006]. 

Remark 2.8. To see the asymptotic optimality, consider the generalization error (^{f) — <^{fp) = 
11/ ~ fpWp [^-g- see Cucker and Smale 2002]. Since the rate 0(t~''/(^^+^)) dominates when r > 1/2, 
then under the same condition of Theorem C, there holds with probability at least 1 — 6 {6 £ (0, 1)), 
for aU t G N, 

m)-'^{fp)<0{t-''-/'-^'-+'^). 

For r G (1/2, 1], the asymptotic rate 0(t~^^/^^''^"^^) has been shown to be optimal in the sense that 
it reaches the minimax and individual lower rate [Caponnctto and De Vito 2005]. To be precise, 
let ^{b,r) {b > 1 and r G (1/2, 1]) be the set of probability measure p on ^ x such that: (A) 
almost surely \y\ < Mp-, (B) Lj.^'fp G (C) the eigenvalues (//n)nGN of Lk : ^p^ ^p^, 

arranged in a nonincreasing order, are subject to the decay /i^ = 0{n~^). Then the following 
minimax lower rate was given as Theorem 2 in [Caponnctto and De Vito 2005], 

liminf inf sup Proh \ {zi)\ G : S'{ft) - S{fp) > Ct"^| = 1 
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for some constant C > independent on t, where the infimum in the middle is taken over all 
algorithms as a map B {zi)\ ^ ft £ J^k- 

Note that in the minimax lower rate, the probability measure may change for different data 
size t, which violates the fundamental identical distribution assumption in learning. Therefore 
[Gyorfi, Kohler, Krzyzak, and Walk 2002] suggests a kind of individual lower rates for learning 
problems. The following individual lower rate was obtained as Theorem 3 in [Caponnetto and De Vito 2005]: 
for every B > b, 

mf sup limsup ^ — > 0, 

where the infimum is taken over arbitrary sequences of functions ft ■ — ?• J^k- It can be seen that 
the key difference in the individual lower rate, lies in that by putting limsup^^go before sup^g^^^ 
the probability measure p is applied to all sufficiently large t. 

Now we compare these lower rates to our upper bound. Since Lk '■ -^p^ -^ps: ^® ^ trace- 
class operator, its eigenvalues are summable. Therefore by taking h = B = \^ one may obtain an 
eigenvalue-independent lower rate 0(t~^'^'/'^^'""''^)) for all possible Lk- Therefore, the upper bound 
by Theorem C reaches both the minimax and the individual lower rates. 



3. Sequential Stochastic Approximations of Regularization Paths in Hilbert 

Spaces 

In this section, we study some stochastic approximation sequences in the more general setting 
of general Hilbert spaces. 

Let be a Hilbert space with inner product ( , ) and associated norm \u\ := \J {u, u), and 
let SL{W) be the vector space of self-adjoint bounded linear operators on W , endowed with the 
canonical norm 

ll^ll := sup ||^x||. 

||x||<l 

Let J2r and 3^ be two topological spaces (on which we make no other assumption), let 3f := 
2^ and let p be a probability measure on the Borel cr-algebra of iF. Let A : 3f ^ SL(W) and 
h : ^ W he random variables on the sample space ^ taking values respectively in SL{W) and 
and let 

A:=¥.[A], 6:=E[6] 

be their expectations on [3^, p). 

Now assume that j4 is a positive operator, hence invertible, but that it has an unbounded inverse. 
Knowing A and h, but not p (and subsequently not A and 6), and assuming 5 € A{W)^ the aim is 
to devise a stochastic algorithm approximating the solution w of the following linear equation 

(7) Aw = b, 

using as data an i.i.d sequence {zt)t£N in 2f with probability law p. As in the standard setting 
of Robbins-Monro (see [Robbins and Monro 1951], [Kiefcr and Wolfowitz 1952]), it is natural to 
consider a stochastic gradient descent algorithm. 
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More precisely, the search for the sohition w of (7) is equivalent to the minimization of the 
quadratic potential map V : W 

V{w) := -{A{w — w), w — id), 

whose gradient grad V : W ^ W is given by 

grad V{w) = Aw-b = E[Aw - b]. 

In the context of online learning presented in the first two sections, W := Jifx, ^((^i?/))!/) '■= 
f{x)Kx, b{{x,y)) := yK^ (see Section 3.3), so that A = Lk, b = Lxfp and w = fp, and V{w) = 
11/ ~ /pll^p2 = — <^{fp) is the generalization error. 

A natural Robbins-Monro gradient descent algorithm would be 

(8) wt = wt-i - '-it{A{zt)wt-i - b{zt)), 

since Ez^^p[A{zt)wt-i - b{zt)] = Awt-i - b. 

However, the sample complexity analysis on Hilbert spaces, in order to estimate the sample size 
sufficient to approximate the minimizer with high probability, requires boundedness of A'^^ (see for 
instance [Smale and Yao 2006]). 

To solve this ill-posed problem with unbounded A~^, one may construct sequences of random 
variables (j4f)tgN and (6t)jgN on the sample space 3f taking values respectively in SL{W) and W , 
with the assumption that, if 

At :=E[At], It :=E[6t] 

are their expectations on {3^,p), At has bounded inverse and At A, bt ^ b. Then the aim is to 
find assumptions ensuring that the stochastic approximation sequence {wt)teN iteratively defined 
hy wq := Wq deterministic, and 

(9) Wt = wt-i - 'yt{At{zt)wt^i - bt{zt)), 

where (7t)jgN is a real positive sequence, converges to the solution w of (7) as t goes to infinity. 
This question can be divided into two subquestions: first the deterministic convergence of 

(10) tvt := A^Hf 

to w, the path t ^ wt being then called a regularization path of the solution of equation (7), and 
second the probabilistic convergence of the quantity 

(11) rt:=wt-wt, 

which we call the remainder (note that wt = A^^bt ^ '^[wt] in general). In the online learning case 
(see Section 3.3), we choose At := A + Xtl, {Xt)teN positive sequence, bt := b, so that wt = fx^ — )■ fp 
in Jifx- 

We provide in Section 3.1 two structural decompositions of r^, respectively a reversed martingale 
and a martingale one. Both expand rt into three parts: one depending on the initial value of r. 
called the initial error, one depending on the drift 

(12) Aj := Wj — Wj-i 
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along the regularization path {ivt) called the drift error, and finally one random variable of zero 
mean called the sample error, respectively written as a reversed martingale and as a martingale at 
time t. 

The reversed martingale decomposition will, on one hand, enable us to prove Theorem 3.5 below, 
whose corollary is Theorem A in the context of online learning, and which provides sufficient 
assumptions on the asymptotic behaviour of the norms of At, A~[^ , A~[^ and bt for the convergence 
of the variance of the remainder rt. On the other hand, this reversed decomposition will yield 
Theorem B giving upper bounds on ft — fp in with high probability, proved in Section 5. 

The martingale decomposition will imply Theorem C giving upper bounds of ft — fp in -^p^ 
with high probability, proved in Section 6. 

3.1. Two Structural Decomposition Theorems. For all j, i G N, let II*- be the random oper- 
ator onW , on the sample space i^^, defined by 

t 

J](/-7i^i(z,)) ifj<t; 
i=j 



I otherwise. 



By a slight abuse of notation, we let At := At{zt) and bt := bt{zt) in the sequel, when there is no 
ambiguity. 

Theorem 3.1 (Reversed Martingale Decomposition). For all s, t & N, t > s, 

t t 

(13) rt = n*+ir, - - bj) - Yl n$A, 

j=s+l j=s+l 

Remark 3.2. Note that n*_|_^ is an operator whose randomness only depends on Zj^i, . . . Zt, whereas 
the randomness in AjWj — bj, with zero mean, only depends on zj. By independence of z^, t G N, the 
conditional expectation K[yjI[^j_^-^^{AjWj — bj)\zj^i, ... ,zt] is 0, whence for each t, 7jn*_,_^(Ajtt)j — bj) 
is a reversed martingale difference sequence whose sum is a reversed martingale sequence with zero 
mean. For more background on reversed martingales, see for example [Ncvcu 1975]. 

Proof of Theorem 3.1. By definition, 

rt = wt- wt 

= vut-i - Wt - ^tiAtWt-i - bt) 

= {I - ltAt){wt-i - wt-i) - {I - -/tAt){tvt - iBt-i) + -/tAt{wt-i - Wt) - jtiAtWt-i - bt) 
= {I - ^tAt){wt-.i - wt-i) - {I - 'ytAt){wt - wt-i) - itiAtWt - bt) 
which implies 

(14) rt = {I- -itAt)rt-i - -it{AtWt - bt) - {I - 7t^t)At. 

The result follows by induction on i G N, t > s. □ 



For all j, t G N, let 

Xt = {At - At)wt^i + (bt-bt), 



ONLINE LEARNING AS STOCHASTIC APPROXIMATION OF REGULARIZATION PATHS 



13 



and let II*- be the deterministic operator on W defined by 



3 



i=j 

I, otherwise. 



Theorem 3.3 (Martingale Decomposition). For all s, t G N, t > s, 

t t 

(15) ri = n*+ir,+ ijtiUiXj- E n*^- 

j=s+l j=s+l 



Remark 3.4. The martingale decomposition was proposed in [Yao 2010]. Contrary to the reversed 



martingale decomposition, only the sample error is random here, the operator II*- , , being deter- 



ministic. The process {'yjIlj_^_iXj)jefi is a martingale difference sequence since, for all j G N and 
t > j, Kj^i[yjIl'j^^Xj] = 0- Note that the martingale property continues to hold for dependent 
sampling zt{zi, . . . ,zt-i), as long as Et^i[At{zt)] = At and ¥,t~i[bt{zt)] = h- 

The non-randomness of the operator II*- will play a key role in the proof of Theorem C in the 
online learning context, since it will enable us to make explicit calculations involving the spectral 
decomposition of Lk '■ -^o^ — ^ -^oV (recall that Ai = Lk + Aj then). However, the fact that xti 
contrary to AtWt — bt in the reversed expansion, does not depend only on zt but rather on the whole 
past (zj)o<i<t, makes it necessary to obtain a preliminary upper bound of xt in Appendix C, which 
explains the factor (log 2/(5)^ in Theorem C, rather than log 2/(5 in Theorem B. 

Proof of Theorem 3.3. By definition, 

rt = wt- wt 

= wt-i - Wt- -ft{AtWt-i - bt) 
= {I - -ttAt){wt-i - Wt) + jtXt, using bt = AtWt 
= {I - -ftAt)rt^i + -ftXt - {I - ^tAt)[wt - wt-i]. 
The result follows by induction ontGN,t>,s. □ 

3.2. Sufficient Conditions for the Convergence of the Remainder. The following Theorem 
3.5, which implies Theorem A in the context of online learning (see Section 3.3), states the con- 
vergence of llr^lp = 1 1 Iff — w^lp to zero in expectation, under some assumptions on the asymptotic 
behaviour of the gain sequence 7t and of the norms of bt and operators At, A'^^ and A^^ . 

The corresponding Generalized Finiteness Condition on the asymptotic behaviour of At and bt 
is a generalization of the Finiteness Condition in [Smale and Yao 2006]. 

Generalized Finiteness Condition. Let {at)ten and {at)t&n be deterministic positive sequences. 
For all t G N, assume that almost surely. At is positive, and the operators At, At and A are invertible 
(although A has an unbounded inverse), and that 

\\At\\ <at, WA^^W <at^. 

Theorem 3.5. Consider the stochastic approximation sequence (wj)tgNo '^''^d remainder {rt)t£No 
defined in (9)-(ll). 
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Suppose that the Generalized Finiteness Condition holds, and that the variance ¥,\\AtWt — is 
uniformly bounded in t G N. Then 



if the following assumptions hold: 
(A) 7t and ^ ^tOt = oo, 



t 

n n 



(B) limsup ^^7^ n (1 - = 0' 



k=i i=k+i 



fC; limsupJ^IIAfcll W (l-7iai) = 0. 



k=i i=k+i 

The following Lemma 3.6 enables us to provide simple sufficient conditions for {B) and (C) in 
Corollary 3.7. 

Lemma 3.6. Let {at)t&i and {ht)tm be two real positive sequences converging to when t goes to 
infinity. Then 

n n 

limsup at/bt = and = oo =^ limsup ^Ofc J] (1-6,) = 0. 

t — ^OO , ^, t — ^OO , ■ , , 

teN k=l i=k+l 

Corollary 3.7. In the statement of Theorem 3.5, assumptions (B) and (C) may respectively be 
replaced by 

(B') limsup — = 0, 

(C) limsup^^ = 0. 

t^oc atlt 

Theorem 3.5 and Lemma 3.6 are proved in Appendix C, and imply Corollary 3.7: Lemma 3.6 
with at '■= 7^ (resp. at := ||A(||) and bt := 0^7* shows that (B) (resp. (C)) implies {B') (resp. 
{C')). 

The proof of Theorem 3.5 makes use of the following preliminary Lemma 3.8 (shown in Appendix 
C), which implies some upper bounds of the norms of operators H*-, t > j, also used in Sections 6 
and 5. 

Lemma 3.8. Let jo G N, and let {■jt)t&N, (ot)tGN o,nd {at)tenbe real positive sequences, and let 
{At)t(^m be a sequence of positive compact self-adjoint operators on the Hilbert space W . Assume 
that, for all t > jo, \\At\\ < at , ^ o-nd jtoit ^ 1- 

Then, for all t > jo and jo < j < t, 
(A) ||/-7Mt|| < l-iAt; 



ri?;i|n5ii<n(i-7,A,)- 
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In particular, if the two sequences {jt)t£N o,nd (aJtgN o.^^ such that, for all t > jo, jt^t '■= ct 
for some c, to > 0, then (B) yields 

3.3. Application to online learning and Proof of Theorem A. The online learning sequence 
{ft)teNo defined in (6), with assumptions (I)-(II), can be interpreted as a sequential stochastic 
approximation algorithm {wt)teNo in (8), taking values in the Hilbert space W := J^k, with 

Mi^^ y)) '■= (•> K^)kK^, b{{x, y)) := yK^ 
At:=A + XJ, bt := b, 

so that 

A = Lk, b = Lxfp, w = fp, 

At = Lk + At/, wt = fxt , 

ft = ft - fxt, At = fxt - fxt^i- 

Let us emphasize that the operator A is only defined from to here (we would not be able 
to define f{x) for / G ^p^). The properties mentioned below will only hold on Jifx in general, and 
in particular the norms of operators ||.|| are assumed to be although operators defined 

on and commuting with lII'^ (which is an isometry between and Hk) have the same 

norm in either spaces. 

Note that A{z) is positive for all z = {x,y) £ 2f (which implies A = Lk positive as well), since 

{A{{x,y))U)J) = {fix)K,,f) = f{xf > 

for ah / G .^K- 

Also, for ah / G ^k, \\Af\\ = | (i^,., /)| < ||/||, so that 

Pll < K^ pll < E(PII) < K^. 

Hence 

(16) \\At\\ < at := At + k^, WA^^V^ > QLt ■= At- 

Let us prove Theorem A; assume its conditions hold. Then the Generalized Finiteness Condition 
of Section 3.2 is satisfied. Now fp £ J^k implies ||/a — fpWx when A — )■ 0. Therefore the 
conclusion follows from the convergence of ]E[||ii;t — w't P] = E[||/t — fx^ |p] to in Theorem 3.5, the 
condition of uniform boundedness of E||y4t?Dt — 6t|p being shown in Lemma 5.5 (B). 

For convenience, we will use, in Sections 5 and 6, the notation 

Lt := A{zt) = {.,K^^)kK^, 

We will assume that 

(17) lt = ^, At = zi?:g, forsome0G[O,l],to>O, aG(0,tg),6G(0,tJ-^), 
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and then study the or -^p^- norm of the error ft — fp, based using a reverse martingale (resp. 
martingale) decomposition in Section 5 (resp. Section 6). We will then optimize the upper bounds 
in 9, a and b by using some prior information on the regularity of fp. 

Finally observe that Lemma 3.8 implies, using (16), that for all j , t £ t > j, 

(18) 11/ - <i-jAt=i-f, mw < {^^f 

if tQ > a(K? + h) (and, therefore, ^toit = 7tAf + 7j«;^ < aht^^ + aK^t^^ < 1 
Similarly, for all j, t G N, t > j. 



(19) l|/-7Mtll < 1-— > l|n-|| < 



t ' " - \t + i 



ab 



if tf) — ^('^^ + norm in (19) can be as well as ||.||_^2 _^^2 . 

4. Estimates of Drift on the Regularization Path 

This section is devoted to estimates on the drift ||/a — f^\\, X, ^ > 0, along the regularization 
path A — >• fx, in J^-norm or ^^^^,-norm, assuming that L^' fp G -S^?^- for some r > 0. These 
estimates enable us to upper bound on the one hand the approximation error ||/a — /p|| (when spe- 
cialized to ^ = 0) , and on the other hand the drift error in the martingale and reversed martingale 
decompositions. 

Note that the estimate ||/a — = 0{\ \ — ^\) in the case r = 1 is not improved by increasing 
r. This is related to a phenomenon usually refered to as the saturation problem in regularizations 
[Engl, Haiikc. and NcubaTicr 1996]. 

Theorem 4.1. Let A > /i > 0. Assume that L'^ fp G -^p^ for some r > —1. 
(A) If re [-1,1] \{0}, then 



II f f II ^ \ \r ,,r| K fpWp . 

r 

(B) Ifr>l, then for any 1 < s <r, 

||/A-/p||p</^'(^-')|A-/i|||L^7pllp; 



(C) Ifr> 1/2, then 

(D) If r £[-1/2, 3/2]\{l/2}, then 



A - fp[[K < ^- — i^ll/pllic; 



II f f II ^l\''~l/2 ,,r~l/2i\\^K fpWp 
ll/A - fp\\K < |A ' - /i 



r-i| ' 

' 2 1 



(E) Ifr> 3/2, then for any 3/2 < s < r, 

[[fx - f^[[K < '^'^'-'^'^[X- Ii[[[L],'fp[[p. 
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Proof. Fix A > /i, assume L^^ fp G ^p^^- for some r G [—1, 1] and let ||.|| := ||.||^2 _^^2 . We first 
prove that, for all n > —1, if we let 

Ju^,^. ■■= (/u - \){Lk + \ir\LK + ^iiy^L]^'' 
then, for ah t G [-1, 1] \ {0}, u>t, 

(20) ||J„,A,Mll<^'^""*^|A*-MV|t|- 

This will be useful, since 

(21) /a - //. = (/i - X){Lk + \I)-\Lk + l^iy^LKfp = JrX^^L-^'fp, 
using that 

{Lk + XI) fx = LKfp, {Lk + fil)fp = LKfp. 
Let us prove (20): using = ||ivx|r"* < by (3), and max(t, 0) + min(0, t) = t, 

\\Ju,x,p\\ < |A - fi\\\{LK + Xir^^^'''>^-\LK + ^/)'^^"(*'°)l^(*+'^L]+"|| 

< |A - ^|A-U'^^^(*'°V™°^*'°^II-^^^"*II < K^^''"*^|A - ^|A-Vax(A*,^*) 
= K2("-*)A(^)|A*-/i*|, 



where 



A(m) := 



Now 



Indeed, if t > 0, then this is a consequence of x < (1 — (1 — x)*)/t applied to 2: := 1 — /u/A, using 
that x I— 7- (1 — (1 — xY)/t (defined on (—00, 1]) is convex and thus remains above the tangent line 
at 0. Similarly, we use x < (1 - (1 - x)"*)/(-t) if i < 0. 

Now (20)-(21) implies (A) with u := r and t := r, and (B) with u := s and t := 1, since 
fp G -^p^ implies Ljf fp G ^p^^ for any s < r. Similarly, (D) (resp. (E)) follows from 

Lk ^ (f^ ~ /m) ~ Jr-i/2,x,p.Lj/ fp, 
and (20) applied to u := r — 1/2 and t := u (resp. u := s — 1/2 and t := 1). 

Let us now prove (C): if r > 1/2, then fp G J^k, and the first part of equality (21) implies 

II/a - fM< < 1^ - A|||(Lk + XI)-'\\\\{Lk + f^I)-'LK\\\\L-^^^fp\\p < ^^^H/pk- 

□ 
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5. Upper Bounds for Convergence in 

Throughout this section, we assume that LjTfp £ -^p^ for some r G (1/2,3/2], which imphes 
fp G J^K with additional regularity, and assume that the sequences {'yt)teN and {\t)teN are chosen 
in (17). 

Our goal is to provide a probabilistic upper bound for 

Wft - fp\\K, 

in order to prove Theorem B. We start with the triangle inequality 

Wft - fp\\K < Wft - fXtWK + WfXt ~ fpWK, 

and apply the reversed martingale decomposition of {ft)ten developed in Section 3, Theorem 3.1: 

t t 

(22) n = n* ro - 7,n*+i(^,u), - b,) - n* a,. 

We make use of the corresponding notation of Section 3, in particular Section 3.3, so that 

^i^i - = i^t + kl)fxt - ytKxt, 

and 

/ otherwise. 

Now 

Wft - fpWK < <^init{t) + 

where we define the errors as follows: 

(A) Initial Error: Sinit{t) = ||n^ro||x comes from the initial choice /o; 

(B) Approximation Error: (§approx{t) = ||/At — /pllx, measures the distance between the regression 
function and the regularization path at time t; 

(C) Drift Error: S'drift{t) = II X]j=i njAjlU' comes from the drift along the regularization path 
t^fx,; 

(D) Sample Error: £'samp{t) = || Yl]=il3^]+i{^j'^3 - ^Mk, where = 7jn*_^^(Aj% - bj) is a 
reversed martingale difference sequence, reflecting the random fluctuation caused by sampling. 

In the remainder of this section, we are going to provide upper bounds for each of the four errors, 
which, roughly speaking when a6 = 1, are 

<^init{t) = 0{t-^), 
'^approx{t^ — 0(t ^ ^ ^), 
^dr.ftit) = 0(t"(^"V2)(l-.))^ 

<^samp{t) = 0(t2 ^). 

It is not surprising that the approximation error and drift error have the same rate, as both 
of them come from the estimates on drifts in Theorem 4.1. This suggests our explanation that 
the bias = <§approx{t) + ^drift{t) and the variance = i§samp{t)- Theorem B then follows from these 
bounds by setting 6 = 2r/(2r + 1). 
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5.1. Initial Error. 

Theorem 5.1 (Initial Error). Let > a{H? + b). Then for all i G N, 



where = {to + l)''''||ro||. 
Proof. 

^.nu{t) < limilllroll < (^)"'lko|| < (^)"'lko|| 
where the second last step uses Lemma 3.8 (B) with j = 1. □ 

5.2. Approximation Error. The approximation error is derived from Theorem 4.1(D) by setting 
A = At and n = 0. 

Theorem 5.2 (Approximation Error). For r G (1/2,3/2] and fp G -^p^'j 

\\h. - mik < B^b^^'m-^^-'^'^^'-'\ 

where B^ = {r-l/2)-'\\L-'fX- 

5.3. Drift Error. 

Theorem 5.3 (Drift Error). Let > [a(K;2 + 6) V 1]. Then for r G (1/2, 3/2] and L^^T f^ g 

^ (t\<[ B.,W~'m-^'-'l'^^'-'\ ^fab > (r - 1/2)(1 - 9); 

- I ^^^r-i/2^-a6^ ^^^^ ^ _ ^/2)(i _ e), 



\ab-{r-l/2){l-e)\ 



where B2 = — — ^ T^W^kIpHp- 



Proof. We are going to provide an upper bound of 

t 

<^drift{t) = II n^Ajiix. 

First, Lemma 3.8 implies, using (16), that for all j, t G N, t > j. 



j + to 



ab 



P3) Iin5ll^>j + i 

if > a{K^ + b) (and, therefore, 740^ = + 7*/^^ < o6tQ ^ + aK^t^^ < 1). 
Second, by Theorem 4.1(D), 



K JpWp 
' 2 



(24) < 6'-"l/2(i_^)(^_i)-(r~l/2)(l-e)-l||^-r^^ 



K JPWP-I 



20 



PIERRE TARRES AND YUAN YAO 



where we use 

|ApV2 _ ^ ^._l/2 |^-(.-l/2)(l-e) _ _ ^)-(.-l/2)(l-. 

< V~V\r - 1/2)(1 - 6){t - i)-('-i/2)(i-f?)-i^ 

due to the Mean Value Theorem with = 2;-('^-V2)(i-0) and/i'(x) = -(r-l/2)(l-6')x-(^-i/2)(i-e)-i ^ 
such that 

\ht-h{t- 1)1 = \h'{7])\ < \h'{t-l)\, for some i] G (t- Ij). 



Now combining (23) and (24) gives 

'^drifdt) = ||^n*A,|k<6^^-V2(i_0)l|L-7,||,.^(^^y\j + to-i)- 



ir-l/2){l-e)-l 



< ''^"''(l;!i!fi^^^"^ +^o)-^-^-^-^/-^(^-^). 

It suffices to bound 

V(j + torb-l-ir-l/m-e) < /■^'■'(^ ^ ^^^,b_i_(,_i/2)(l-e)^^ ^. 
i=l ^0 

Now, if ab> {r- 1/2) (1 - 0), then 

(t + l)afc-{r-l/2){l-0) 

^* - a6- (r- 1/2)(1 -0)' 
whereas ab < {r — 1/2) (1 — 9) imphes 

ab-{r-l/2){l-e) ^ 
J <; ''0 <; t f > 1 

*- (r-l/2)(l-e)-a5 - |a5-(r-l/2)(l-^)|' ' 



□ 



5.4. Sample Error. 

Theorem 5.4 (Sample Error). Assume that > [0(^2 + 6) V 6 V 1] and ah ^9 -1/2 or (36*- l)/2. 
Then, with probability at least 1 — 5 (6 £ {0, 1)), 

2(^+l)2M, 2 8kM, 2 

where = log - ana = — , log - . 

3 5 ^\ah-{9-l/2)\ S 

The proof of Theorem 5.4 requires some auxilary estimates. 
Lemma 5.5. Let AfWt - bt = {fxXxt) - yt)Kxt + hfxf 

(A) \\AtWt - btU < (« + ifMp/y/Xt, iftl~' > b; 

(B) n\AtWt-bt\\l]<^n''Ml. 
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Proof. (A) Using ||/A||ii- < Mpj^fX in Lemma B.l(A), 

\\AtWt-ht\\ < \\fx,{xt)Kxt\\K + \yt\\\K^t\\K + XtWfxMK < MpK^y^t + MpK + Mp^/X 

since \\fx,{xt)K^,\\K = \{fxt, K^M^xtU < Il/Ajkl|i^xj||c < MpK^/y/Yt. Now, 

MpK'^/y^t+MpK + Mpy% < {k^ + K + l)Mp/y%<{K+lfMp/y/x't 

where the second last inequality is due to fg~^ > & =^ < 1. 
(B) Using Xtfx = Lxfp - Lxfx we obtain 

ifxtixt) - yt)Kxt + Xtfxt = i^t - LK)fxt + ^Kfp - VtKxf 



= E\\iLt-LK)fx,+LKfp-ytKxMl 

< 2E[||(Li - LK)fxAK + WLkIp - ytKx.Wl] 

< 2E[\\Ltfx,fK + \\ytKx,fK]<^^\\\f> 



tWp 



Mi 



since E[Lt] = Lk, ^[ytKajJ = L^/p and ||/a||p < Mp by Lemma B.l(B). 

Now we are ready to give the proof of the sample error bounds, Theorem 5.4. 
Proof of Theorem 5.4- We are going to bound 



□ 



■'samp 



(t) 



K 



where £,j = 7jn*-^^(y4jtMj — hj) is a reversed martingale difference sequence. To apply the Pinelis- 
Bernstein inequality in Proposition A. 3, we need bounds on and Ej+i||,^j|||^ where Ej+i[-] is 

the expectation conditional on examples after time j. 

Notice that for > a(K^ + h) and j > 1, using 1 + x < for all x € M, 

ah _ / , \ab-e 



|7,-n 



■j+i I 



< 



(j+to)' 
where e is the Euler constant. 

Now Lemma 5.5 (B) implies 
Hence 



j + tQ + iy- ^ a{j + toY 



t + l 



{t + 1 



\ab 



+ 



ah 



E[\\AtWt-bt\\j,]<4K^Ml 



4e2(a/tMp)2(j + to) 



2ab-2e 



(t + l 



\2ab 



SO that, if to ^ 2 
(25) 



< < 



' 2e^{aKMpf .x-2(g-i/2) „^ ^ ^ i /o 



1/2) - ah 



(i + 1) 



if aft < - 1/2 



On the other hand, if tg ^ > 6, Lemma 5.5 (A) implies 



l^i^i - ^ilk < («^ + l)^p/\/Aj 
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whence 



ea{K + l)^Mp-^(3g^i)/2 



(26) < < 



^ . , if ab> {39 -l)/2 

if a6< (30- l)/2 



Vb 

The final bound is obtained by PineHs-Bernstein inequahty in Proposition A. 3 with upper bounds 
(25) and (26). □ 

5.5. Proof of Theorem B. We choose 9 = 2r/(2r + l), a > 1, 6 < 1 such that ab= 1, and assume 
> + 1. Using Theorems 5.2, 5.3, 5.1, and 5.4, 

Note that, by Lemma 5.5(A) with /o = 0, 

M 4r + 3 

Bs = (to + l)||r-o|| = (to + 1)||/aoII < Co := 2to^ = 2t^^+' Mp 
On the other hand, 

C, := B, + B2= (-^ + ^) ||L-7pllp = (2^_T)(2,^+3)ll^^^'/pllp 
and, using \/at'^^ < and to > 1 



T,,-e r^T, <r + ^ 8kM, 20(K + l)^Mp 
fi4to ^/^ + B5< + -j=<C2:= , 

which concludes the proof of Theorem B. 

6. Upper Bounds for Convergence in 

Throughout this section, we assume that L'^ fp E -^p^- foi' some r G [1/2,3/2], which implies 
fp G ,3^K with additional regularity, and assume the sequences {'■^t)t&i and [\t)t<m are chosen in 
(17). Note that the case r = 1/2 is included here, whereas it was not in Section 5 and Theorem B. 

Our goal is to provide a probabilistic upper bound of 

\\ft ~ /pllp) 

in order to prove Theorem C. As in Section 5, we start with the triangle inequality 

ll/t - fpWp < Wft - At lip + Wfxt - fpWpi 

but apply here the martingale decomposition of {ft)t£No in -^p^^ developed in Theorem 3.3 instead: 

t t 

i=i i=i 
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We make use of the corresponding notation of Section 3, in particular 3.3, so that 

Xt = {Lk - Lt)ft-i + {ytKxt - LkIp), 



and 



(27) n 



Hil - j^{Lk + X^I)) ifj<i; 
i=j 

I otherwise. 



The martingale decomposition enables us to make use of the isometry i]/^ : -^pag- ~^ '^K, in 
the sense that one can benefit from the spectral decomposition of L^j^'^YL^j to get a tighter estimate. 

1 /2 

This was not possible with the reversed martingale decomposition, since II*- does not have an 
obvious spectral decomposition. 

Note however that xt depends on ft-i, so that we need preliminary estimates of ||xt||p) provided 
in Appendix B. 

As in Section 5, we introduce the following definitions for convenience. 
[Definitions of Errors] 

(A) Initial Error: S'lniti^) = lln^'^ollpi which reflects the propagation error by the initial choice /o; 

(B) Approximation Error: (^approx{i) = \\f\t ~ /pllpi which measures the distance between the 
regression function and the regularization path at time t; 

(C) Drift Error: Sdriftit) = || S5=i njAjHp, which measures the error caused by drifts from /aj_i 
to f\- along the regularization path; 

(D) Sample Error: (^samp{t) = \\Yl]=i'yj'^^j+iXj\\pj where Xj is a martingale difference sequence, 
reflecting the random fluctuation caused by sampling. 

Our aim is to bound 

Wft - fpWp < ^init{t) + ^sampit) ~\~ ^driftify ~\~ '^approxit) ■ 

In the remainder of this section, we are going to provide upper bounds for each of the four errors, 
which, roughly speaking when ab = 1, are 

'^initit) = 0{t-^) 
S'approxit) ~ 0(t ^ 

^samp{t) = 0{t ^ ) 

This suggests our explanation that the bias = S'approx{t)+S'^rift{t) = 0{t~'^^^'^^) and the variance 
= S'samp{t) = 0(t~^/^) similar to the batch learning setting. Theorem C then follows from these 
bounds by setting 6 = 2r/(2r + 1). 

6.1. Initial Error. 

Theorem 6.1 (Initial Error). Let t^ > a(K^ + b). Then for all t G 

<^mit{t) < B^t ""^ , 



24 



PIERRE TARRES AND YUAN YAO 



where Bq = Mp{to + 1 



Proof. Lemma 3.8(B) with j = 1 and (16) imply that , if > a(K^ + b), 



ab 



5mit(t) < ||nj||||ro|| < ( j-pY ) Ikol 



^0 + 1 

For /o = 0, using Lemma B.l(B), ||ro||p = ||/aoIIp < Mp. □ 

6.2. Approximation Error. 

Theorem 6.2 (Approximation Error). For r G (0, 1] and L'^ fp G ^p^ci 
where Bj = r~^\\L~^fp\\p. 

Proof. Follows from Theorem 4.1(A) with X = Xt and /.t = 0. □ 

6.3. Drift Error. 

Theorem 6.3 (Drift Error). Assume tl > [a{K^ + 6) V 1]. Then, if r £ (0, 1] and L~^fp G ^p^, 

^ ^ ' - \ B^Vt-''\ if ah < r(l - 6), 



4(1 -g) 
\ah-r{l-e)\ 



where Bg = — — —\\Lj^''fp\\p. 



Proof. Similar to the proof of Theorem 5.3, replacing r — 1/2 with r. □ 

6.4. Sample Error. In this section we assume b = a~^ for simplicity; this is necessary for the 
bounds in Appendix B, in particular Corollary B.7, and is enough to provide the optimal bounds 
we need (see discussion after statement of Theorem A). 

Theorem 6.4 (Sample Error). Assume that LjJ'fp G ^p^ for some r G [1/2,1], 9 G [1/2,2/3], 
a6 = 1, a > 4 and tQ>2 + SK^a. Then, for all t G N, with probability at least 1 — S (S G (0, 1)J 

^sa^pit) < V-aB,'-^ + (ay^B,,V^t + a^/^Bn) ^'Z'^^' 

where 



samp\-j ^ W _g/2 ' \^"' --iu v --6 - i ^iij _(36(-l)/2 ' 



Bq := WuMp, Bio ■= 63K^Mp, Bn := BOk"^ Mpt]^"^' 



Proof Fix t G N, (5 G [0, 1], and let 



(28) Au ■■= KMpO Uatl^^ ^ + 15^/hgt 



, 2 
log^. 



For all j G N, let us define the martingale difference sequence 
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where we make use of the notation of Appendix B. Recall that Corollary B.7 implies, with proba- 
bility at least 1 — 6, all the indicator function events for 1 < j < t hold, which will be assumed in 
the computation below. 



Recall that 



Xj = {Aj - Aj)wj-i + bj - bj = {Lk - Lj)fj-i + yjK^j - Lxfp 



where Lj := { ,K^^)K^.. 

Using Lemma B.3 and the decomposition fj = fp + gj + hj in Appendix B, we deduce that, for 
all 1 < j < if + to)'-'/' < At^s, 

< 3K2[E,-_i|y,- - + E,_i|<7j_i(a;,)|' + E,_i|/i,-_i(x,-)|'] 



Now, using the isometry l)^^ : ^p^ -^K^ 



l/2frt 



In order to estimate X]j=i < isH-'^j+i-^-^'-^i+ill' ^'^^all that (/Xq,, ^a)a&\ is an orthonormal eigen- 
system of Lk '■ -^pa^ — ^ -^pa: ■ ^« ~ ^^-^^ ^^Ma ^^r simplicity; then 

t t t 



t 



7i4-,t,<5 n (1 - «^ 



< sup 

Ma 



i=j + l 
t 



sup7j4t,5 n 



j=i i=j+i 
t 

n (1 ~ 

i=l j=j+l 



where for large enough to. 



sup7j4t,<5 n (1 - < sup7j44,5 (l-7iAi 

i=j+l i=j+l 



< 3aK sup 



2 j + to ^Mj 



+ 



< 3aK 



2 \ 
\ 



■361-1 
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and 

t t t t t 

^i^" n (1 - ^ 5^(1 - (1 - n (1 - ^*/^-) = 1 - 11(1 - 7,^,) < 1. 

j=l i=j+l j=l i=i+l i=l 

These two upper bounds give 



(29) 



Moreover, again if + ^o)^ ^ ^t,(5) then, using Lemma B.3 (B) and Corollary B.7, 

we deduce 

hjKxj - Ljfj_i\\K = WvjK^j - Ljifp + gj^i + hj_i)\\K 

< KM, + ^ + + to)'/'-' =: C,,,s, 

which implies 

WXjWK = hjKxj - Ljfj^i - EjlyjK^^ - Ljfj_i]\\K < 2Cj^t,5- 



Therefore 



t 

< 2Ksup^jCj^t,5 f|(l-7iAj), 11^x11 < 



Tt l|l/2 



< 2aK sup ■ 



j + to ( Mp uMp^ Hi A 



+ 



+ 



< 2aK^ 



i t \{3+tof (i + to)(3^-l)/2 (^• + i^)2^^-l/2 

Mp KMp^/a KAt^s 



< 



t 



■e/2 



KMpa + At,s 



where we use to ^ '^^ twice in the last inequality. 
Combining M and at from (29), we obtain 



2|- + a. 



2\faK 



^ +-\m, + k 



(^/3 + l/3)At,5 + KMpa/3 



^-1/2 



(3e-i)/2 ■ 



where we use that 



= IOkA/p > 2k(V15 + 2/3)Mp, 
BiQ = m^'Mp > k2Mp[30(\/3 + 1/3) + 2/(31og2)], 
Bii = BOK^Mptl^^'^ > 2AK'Mptl/^~\V3 + l/3). 



□ 
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6.5. Proof of Theorem C. We choose 9 = 2r/(2r + 1), a > 1, 6 < 1 such that ah = 1, and assume 
d > aK^ + 1. Using Theorems, 6.2, 6.3, 6.1, and 6.4 



ft ~ fpWp ^ ^init{t) + S'approx{t) + <^drift{t) + 'S'sarnp{t) 

T 

< ^+ ({Bj + B8)a-' + V^Bg log 1^ (l) + (a^/^BioV^ + a'^^Bn 



t y ' ■ ' " ^ 6 J \t J 

This enables us to conclude, with Dq := 2MptQ > Bq = Mp{tQ + 1), 

Di := Bj + Bs = —rr-—::\\L~^^fp\\p, 



_ 6r-l ) 



-(1 + r)' 



1)2 := -Bg, := B^, and 1)4 := ^n. 



Appendix A: A Probabilistic Inequality 



The following result is quoted from [Theorem 3.4 in Pinclis 1994]. 



Lemma A.l (Pinelis-Bennett). Let be a martingale difference sequence in a Hilbert space. 
Suppose that almost surely < M and X^j=i < fjf. Then 



Prob < sup 

l<k<t 



i=l 



> e ^ < 2 exp 



a! ^ (Me 



M2^ V 



err 



where g{x) = (1 + x) log(l + x) — x for 2; > 0. 



Using the lower bound g{x) > 2[i+x/3) ' obtain the following generalized Bernstein's 

inequality. 

Corollary A. 2 (Pinelis-Bernstein) . Let S^i be a martingale difference sequence in a Hilbert space. 
Suppose that almost surely ||^j|| < M and Yll=i^i-i\\Ci\\'^ ^ • Then 



(A-1) 



Prob < sup 

l<k<t 



i=l 



> e ^ < 2 exp 



2(c7| + Me/3) 



The following result will be used as a basic probabilistic inequality to derive various bounds. 

Proposition A. 3. Let be a martingale difference sequence in a Hilbert space. Suppose that 
almost surely < M and X]i=i — ^t- Then the following holds with probability at 

least 1-6 (6 e {0, 1) ), 

k 



sup 

l<k<t 



fM \ , 2 
<2(^-+..jlog-. 



Proof. Taking the right hand side of (A-1) to be 5, then we arrive at the following quadratic equation 
for e, 



2M 



e log - — 2a1 log 



0. 
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Note that e > 0, then 



1 2M , 2 /4M2 „ 2 2 
2| — l°^^ + V^^°^^ + '"'^°^^^ 

M ^ 2 lfM\\ 2 2 TT^ 2 



2M 2 L 2, 2 
— log- + ^2.2 log- 



where the second last step is due to Va^ + < a + 6 (a, 6 > 0) with 



^,2 , , / 9, 2 



We complete the proof by relaxing ^/2oflog2/5 < 2atlog2/5 since 2 log 2/5 > 1 for 5 G (0,1). □ 



Appendix B: Preliminary Upper Bounds 

Appendix B is devoted to the proof of preliminary upper bounds on the online learning sequence 
{ft)t£N defined in (6), and on the regularization path A i— )■ /a- We make use of the notation of 
Section 3, in particular Section 3.3. For simplicity we assume /o := 0; note that another choice 
would correspond to adding Il\fo to ft at time t. We assume that the sequences {'yt)t€N and {Xt)teN 
are chosen as in (17). 

Firstly, Lemmas B.l and B.2 provide deterministic upper bounds. Then the rest of the Appendix 
aims at obtaining probabilistic bounds of { ft)teN, based on a decomposition of ft — fp into two parts 
in (B-2): gt is purely deterministic and is upper bounded in ^^^^-norm in Lemma B.3, and ht is 
studied in detail in Lemmas B.4 and following. Lemma B.7 yields logarithmic estimates with large 
probability. 

Lemma B.l. For any A > 0, 

(A) UxWk < Mp/^/X; 

(B) ||/a|Ip<M,. 

Proof. (A) By definition, 

/A = arg min ||/ - + A||/||2,. 

The term we minimize on the right-hand side takes the value ||/p||p at / = 0, so that 
(B-1) ||/a-/pII^ + A||/a|||,<||/pI|2<m2, 

which yields the result. 
(B) Using (5), 

\\fx\\p = \\{LK+\ir^LKfx<\\{LK+\ir^LK\\\\fpt<\\fp\\p<Mp- 

□ 
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Lemma B.2. Assume 4 > a{K^ + b). Then, for all t e N, 



\\ft\\K< 



Proof. Recall that ft = {I — 'ytAt)ft-i + ItUtKxf Now assume ^ + 1- using (18), 

\\ft\\K < \\l - ltAt\\\\ft-l\\K + IthtK^^WK < {l-lt\t)\\ft~l\\K+ltKMp. 

By induction on t, we deduce 

t t ^ t t ^ 



since 



j=i t=j+i i=i «=i+i 



E^^-^^- n (1 -7»A.)= 1-11(1 -^^^^)- 



□ 



In the rest of Appendix, we prove probabilistic bounds of {ft)teNo- First observe that the defini- 
tion of the online learning sequence (6) can be rewritten as 

ft-fp = [i- it{Lt + Xtmft-i - fp) + it{ytK.,, - Ltfp) - jAtfp. 

Let us now define the following -adapted processes {gt)t£No and {ht)teNo recursively by 

90 ■= - fp, ho ■■= 0, 

and 

gt- = [I - lt{LK + \tl)]gt-i - Jtkfp, 

ht: = [I - -ft{Lt + Xtl)]ht^i + -ftivtKxt - Ltfp) + jt{LK - Lt)gt-i. 
We can easily prove by induction that 

(B-2) f^-f^ = g^+ht, 

using /o = 0. 

Lemma B.3. Assume > + b). Then, for all t G No, 

(A) \\gt\\p < Mp; 

(B) \\gt + fp\\K<^Mp/^f 

Proof. We prove (A) by induction: \\go\\p = \\fp\\p < Mp and, for all t e N, if we assume < 
Mp then, using (19), 

\\gt lip < ||/ - lt{LK + hi)] II \\gt~i lip + 7t At ll/pllp < (1 - 7t At) ||5i-i lip + 7t AjM^ < Mp. 
To prove (B), observe that, for all t G N, 

gt + fp = [I- jt{LK + Xtl)]gt-i + (1 - 7tAt)/p = [/ - 7t{LK + Atl)](5t-i + fp) + itLxfp 
= [I- jt{LK + Xtmgt-i + fp) + lt{LK + Xtl)fx„ 

so that 

9t + fp- fx, = [I- lt{LK + At/)](5t-i + /p - /aJ. 
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Let, for all t G N, 

lot := gt + fp- f\f 
Then it is easy to show by induction that 

t 

wt = n{wo + Y,^iih, - 

k=l 

which implies, using Theorem 4.1 (D) with r = 0, and Lemma B.l (A) (wq 

\\wt\\ < 2Mpl^f 

This enables us to conclude, using again Lemma B.l (A). 

For all t G No and M G M+ U {oo}, let 

■= '^{\ht-i{xt)\<M}Lt, Lt := l{iht-i(xt)\>M}Lt, 
LK:=Et^i[Lt], LK:=Et.i[Lt]. 

Note that Lt = Lt + Lt and Lk = Lk + Lk- 
For ah t G N, let 

(B-3) ht := [I - ^tiLt + Xtl)]ht-i + itiytK^^ - Ltfp) + jt{LK - Lt)gt-i = ht + jtLth-i 
(B-4) kt := ht-{l- lt\)ht-i = -it[-Ltht-i + {vtK.^, - Ltfp) + {Lk - Lt)gt-i] 
(B-5) = 7t[-Lt/it_i + ytK^^ + LKQt-i - Lt{fp + gt-i)]- 

In Lemma B.4 we upper bound in conditional expectation; note that the result still holds 

when M = oo. We threshold ht into ht in order to limit its conditional variance, which will be 
necessary in order to obtain logarithmic estimates with large probability in Lemma B.7, using on 
the other hand Lemma B.6 showing that, if M is large enough, ||/it||A' < ll^tlli^- 

Lemma B.4. Assume > 2a{b + k^). For all M G M+ U {oo} and t G N, 

^t-iWMW] < (1 - IthfWht^ifK + s^'M^^^. 
In particular, assume moreover that 9 > 1/2, e := ah — {6 — 1/2) > and to > max(2a6, 2e, e + 

. —1/2 

(20 — l)/e), and let A := aKMp^S/e. Then \\ht-i\\K ^ -^t implies 

t-^'^¥.t-l[\\ht\\K] < {t-lf-^'^\\ht-l\\K. 

Proof. For all t G N, let 

Ct ■■= {Lk - Lt)ht-i + {Lk - Lt)gt^i + {ytK^t - Ltfp), 

so that 

ht = [I- 7t{LK + At/)]/it„i + jtCt- 

Using Ef_i[Ct] = 0, we deduce that 
(B-6) E^_i 017^^111,] = \\[I-jt{LK + XtI)]ht-ifK + l^Et-i[\\CtfK]. 



-f\o) that 



□ 
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Let us now upper bound the two summands in the right-hand side of equahty (B-6). First, 
\\[I - jt(LK + XtI)]ht^ifK 

= (1 - jAt?\\ht-i\\l - 27t(l - ltXt)^t-i[\ht-i{xt)Wih,M-t)\<M}] 

+j^\\Et.i[Ltht.i]\\j„ 

and 

\\Et^i[Ltht^i]\\l < (Et_i[|/it„i(xt)|l||,,^_^(,^)|<M}||i^,,||K])' 

< K'^Et^i[\ht^i{xt)\'^l{iht-^(^xt)\<M}], 

using conditional Jensen's inequahty. 

Second, using that E[ytK;j;^ - Ltfp \ a{Tt-i,xt)] = 0, 

Et-iiWCtWl] = Et-iilKlx - Lt)ht^i + {Lk - Lt)gt-ifK + WvtKx, - Lt/pWl] 

< Et^iiWLtht^i + Ltgt^iWl + \\ytK,,\\l] 

< (2Ei_i[|/ii_i(xi)|'l{|^,^,(.oi<A/}] + 3M2) , 
where we use Lemma B.3 (A) in the last inequality. 

In summary, we obtain that 

^t-i[\\ht\\K] < (1 - 7tAt)2||/it-i||i - 7t(2 - 2jtXt - 37tK2)Et_i[|/i4_i(xj)|2l{|;,^_^(,^)l<A,}] 

Now, the assumption > 2a{b + k^) implies 2 — 27jAt — S^tK^ > for all t G N, which completes 
the proof of the first statement. 



Let us now prove the second statement: 



11-112 



(B-7) 



<(l-i 



1-26* 



Now, since to ^ max(2o6, 2e, e + {20 — l)/e) and 9 > 1/2, we have 

1-29 / , \ 2 



log 



1-1 

t 



t J 



t 



< -= + <0. 

t 72 



using log(l — x) < —X for all x G [0, 1] and log(l — x) > —x — x"^ for all x G [0, 1/2]. 
Therefore (B-7) implies 

At < -^\\ht-if + ?,K^MlaH-^^ < 0. 
The conclusion follows by conditional Jensen's inequality. 
Lemma B.5. Assume tf^ > a{K^ + b) and tl~^ > 6(2 + MM'^); then 

't\\K < 2KMpa6-¥-'^ an(i Et_i[||A:t||^] < Ot'M^k^. 



□ 
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1/2 

Proof. By definition (B-5), using ||i^a:tlk < WLrIpWr < k||L^ fpWx = K\\fp\\p < uMp and 
Lemma B.3 (A)-(B), we deduce 

Mp\ 2k7jM„ 2KMpab-^ 



\\kt\\K < It [k(M + 2Mp) + \\Lt{fp + gt-i)\\K] < t^lt + 2Mp + ^J< - 

where we use tj"^ > h{2 + MM~^) in the last inequahty. 
Now, using (B-4), we obtain 

^t-i[\\kt\\l] < 37i' [Ei_i[||Ii/ii_i||i-] +Et_i[||yiK,Ji.] +Ei_i[||Lt5t-i|||]] 
< H[^MIk^ + \\gt-i\\pK^] < '^l^tMy. 



□ 

Lemma B.6. For all t G N, assume M > Mt := 'inMpab^H^^'^'^ , > 2a{K^ + h) and tl'^ > 
h{2 + MMp^); then 

\\ht\\K < 



Proof. Assume ht-i{xt) > Mt for instance; the other case is similar. By definition, 

ht = ht- jtht-i{xt)K^^ 

so that 

ll^illi^ = WhtWlc - 2jtht-i{xt)htixt) + -ft{ht-iixt)fK{xt,xt) < \\ht\\K 
^iht{xt) > n'^-itht-i{xt)/2. 

But, using Lemma B.5, 
ht{xt) = (1 - -ftXt)ht-i{xt) + kt{xt) > (1 - 7jAt)/it„i(xt) - 2KMpab' t > k 'jtht-i{xt)/2 
if 

2KMpab^H^-^^ < ht^i{xt)/2 < ht^i{xt){l - jtXt - ^^7t/2), 
since the assumption Iq > 2a{K? + b) implies 1 — 7fAt — K^7f/2 > 1/2. □ 



The following logarithmic upper bound holds under the assumptions ab— {9— 1/2) > 0, 6 £ [1/2, 1] 
and to sufficiently large, but we assume b = in its statement, for notational reasons. 



Corollary B.7. Assume 6 £ [1/2, 1], b 
at least 1 — 6, 



^0 — ^ + Sk^o and ti ^ > 45. Then, with probability 



sup \\hk\\Kik + tQ + lf-'^/'^ < HiMp 

0<k<t 



Uatl^^ %15v^logt 



1 2 
log^. 



Proof. Let us first check that the assumptions of Lemmas B.4, B.5 and B.6 are satisfied, and apply 
these lemmas: tf^ > 3 + SnPa > 2a{K^ + b). Now e = ab - {0 - 1/2) G [1/2, 1], and the hypothesis 
to > max(2a6, 2e, e + {29 — l)/e) is satisfied as long to ^ 3, which is assumed here. We choose 
M = Mt = AnMpob-H^''^^] now t^ > Sko and tj~^ > 46 imply tj"^ > 6(2 + AKab-Hl-"^^) > 
6(2 + MiM-i). 
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For all i G N, if \\hi^i\\K > A{i + to)^/^'^ , A := aKMp^/S/e, then 



i\\K < \\hi\\K, (Lemma B. 6) 

< \\hi\\K +E^-l{\\h^\\K) -^i-l{\\hi\\K) 



< 1 



i + to 



-ill/r + ei, (Lemma B. 4) 



where 



satisfies 



ei := \\h^\\K -^i^ii\\hi\\K 



\ei\\K < iKMpab-\i + to)^-^'> and Ei_i[||e,f ] < djfM^^K^ 



Let, for all i e N, 



Vi ■= y^efc(fc + to) 



0-1/2. 



k=l 



{\\hk-i\\K>A{k+toy/^-e}- 



Fix t e N. For all < i < t, ||r/i+i - T]i\\ < AnMpa^tl^'^ " , and 



k=l 



k=l 



Let 



A := { sup \\r]i\\ < luMpU 



Aat, 



1/2- 



3 + 



l<i<t 

By Proposition A.3, P(A) >l-6. 

Now assume A holds. Let, for all A; G N, 

Xk ■■= \\hk\\K{k + to + iy~^/ 



log- 



For all A: < t, let 



771 



:= max{j < k : \\h,\\K < A{j + to + 1)'/'"^}. 



If m < k, then 

Xm+i < [^(m + to + l)^/^"^ + 2KMpa2(m + to + l)^"^''](m + to + 2)^-^/2 



< —[A + 2KMpa'^tl^^ ^] <—[V6aKMp + 2KMpahl^^ 



2 

the second inequality comes from [{m + to + 2)/(m + to + l)]^"^/^ < ^5/4^ 

since to > 3. 

On the other hand it is easy to prove by induction that, for all k <t, 

Xk < Xm.+l +'nk- Vm+l 
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and, therefore, 



Xfc < KMptt 



r- 16\ i/2-e V30 / / t 

+ — I at]!' + ^ + 12Wlog I 1 + 



log- 



at, 



< nMpa 



1/2- 


1/2-0 



log' 



120*0^ + lb^/\ogi 



log^, 

using in the last inequality that, for all t > 1 and tQ>2^ 

1 



- + Vlog(l + t/to) < - + Vlog(t + to) < -y%g{t + to). 



□ 



Appendix C: Proof of Results of Section 3.2 

Proof of Lemma 3.8. Assume t > Jq. The spectral Theorem for compact operators implies that 
there is an orthonormal basis of W consisting of eigenvectors of At, so that, if {at,k)ke^ are the 
eigenvalues of At, then 



-iii-i 



rcimat,k > a*, 11-^ - ltAt\\ = max(l - jtat,k) > 



k>l 



k>l 



where we use that, for all /c € N, ^tOLt,k < ItOit < 1. 

But minfc>i at± > implies maxfc>i(l - 'jtat^k) < 1 - Tto*, thus (A). 
The last claim follows from the inequality 
t / \ / t 



n 1 



i + to 



<exp\-Y, 



' i + to 



< exp —clog 



t + l 
j + to 



j + tp 
t + l 



□ 

Proof of Theorem 3.5. First, ^t^t — ?• implies that there exists jo G N such that jti^t ^ 1 for all 
t > jo- Hence Lemma 3.8 (B) applies, so that 



(C-1) 



|n*||<n(i-7.^ 



Let us use the reversed martingale decomposition of r., from times jo to t: 

I kill < ^init{t) + <^samp{t) + ^drift{t)i 

where 



i=io+i i=io+i 
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Now, by (C-1), 



^.ic^inititf) < exp -2 ^ia, E(||rjJ|2) -^t^^ 

\ *=io+i / 

since Ylt'^t-^t — and 

t n 

^drift {t)< ||A,|| JJ(l-7iaJ ^t^ooO 

j=jo+i «=i 

by assumption (C). Now consider the sample error. Using the independence of {zt)tef^, 

2 



IE(<^samp(i)^) 



E 



j=jo+l 



j=jo+l 
t t 

i=io+i i=i+i 

where C := sup^gj^ E||y4t?D( — < oo by assumption. This completes the proof, using (C). □ 

Proof of Lemma 3.6. Let e > 0. The assumptions limsupj^^^^ aj/^t = and ht -^t^oo imply that 
there exists £ N such that at < ebt/2 and 6j < 1 for all t > N. On the other hand, XlteN = oo 
implies that there exists A'^i G N such that, for all n> Ni, 

N n 

n (i-M<f 

k=l i=k+l 

Now 

n n n n 

fc=Ar+l i=k+l k=N+l i=k+l 

and we can write the right-hand side of this last inequality as a telescopic sum, i.e. 

n n n n n 

E n (1-^^)= E n (i-^^)=i- n (i-m<i, 

fc=Ar+l i=k+l k=N+l i=k+l i=N+l 

which enables us to conclude. □ 
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